SAS Foundation Interfaces for Hadoop
| FILENAME statement |
Allow the DATA step to read and write HDFS data files. |
Base SAS |
| PROC HADOOP |
Copy or move files between SAS and Hadoop. Execute MapReduce and Pig code. Execute Hadoop file system commands to manage files and directories. |
Base SAS |
| SQL Pass-Through |
Submit HiveQL queries and other HiveQL statements from SAS directly to Hive for Hive processing. Query results are returned to SAS. |
SAS/ACCESS Interface to Hadoop |
| LIBNAME Statement For Hadoop |
Access Hive tables as SAS data sets using the SAS programming language. SAS/ACCESS engine translates SAS language into HiveQL and attempts to convert the processing into HiveQL before returning results to SAS. |
SAS/ACCESS Interface to Hadoop |
SAS In-Memory Analytics Interfaces for Hadoop
- Hadoop is now one of the file storage systems that SAS uses for SAS In-Memory Analytics product solutions:
- SAS High-Performance Analytics products
- SAS Visual Analytics
- SAS In-Memory Statistics
- SAS Code Accelerator for Hadoop (DS2)
A Hadoop Cluster to Run SAS
Base SAS: FILENAME for Hadoop
Base SAS: The Same Process for PROC HADOOP
SAS/ACCESS: SQL Pass-Through and LIBNAME
Managing Files and Executing Hadoop Commands, Map-Reduce, and Pig
- Part 1: Introduction to Base SAS Methods for Hadoop
- Part 2: The HADOOP FILENAME Statement and PROC HADOOP
- Part 3: Executing Pig Code With PROC HADOOP
Part 1: Introduction to Base SAS Methods for Hadoop
The Hadoop Config.xml File
- The
FILENAME statement for Hadoop and PROC HADOOP require an option that specifies a Hadoop configuration file (config.xml).
- The configuration file defines how to connect to Hadoop.
- The file must be accessible to the SAS client application.
- A SAS administrator commonly manages this configuration for the SAS users.
- This file is often referred to as the Hadoop core-site.xml file.
Hadoop JAR Files
- A collection of Hadoop JAR files is also required on the SAS client machine.
- An environment variable
SAS_HADOOP_JAR_PATH on the SAS client machine defines the location of the Hadoop JAR files
- The Hadoop JAR files must be compatible with the specific Hadoop implementation and can be copied from the Hadoop server.
- A Hadoop system administrator commonly manages the configuration of the Hadoop JAR files for the SAS users.
Base SAS Interface to Hadoop
Part 2: The HADOOP FILENAME Statement and PROC HADOOP
Business Scenario
- We want to develop a prototype for a process that uses SAS to orchestrate the following scenario:
- Move unstructured text files into the Hadoop file system.
- Invoke MapReduce programs developed by Java programmers in order to:
- read and process the text files to perform various analyses
- output results as text files in the Hadoop file system
- Read the summarized text analysis results back into SAS for further analysis and reporting purposes.
Breakdown 1
- Move unstructured text files into the Hadoop file system.
- PROC HADOOP and the HDFS COPYFROMLOCAL statement
Breakdown 2
- Invoke MapReduce programs developed by Java programmers in order to:
- read and process the text files to perform various analyses
- output results as text files in the Hadoop file system
- PROC HADOOP and the MAPREDUCE statement
Breakdown 3
- Read the summarized text analysis results back into SAS for further analysis and reporting purposes.
- FILENAME statement for Hadoop and DATA step
Business Scenario Pseudocode
proc hadoop...;
hdfs copyfromlocal="local file" out="hdfs file";
run;
proc hadoop...;
mapreduce input="hdfs file" output="hdfs outfile" ...;
run;
filename fileref "hdfs outfile"...;
data somedata;
infile fileref input...;
...
run;
Breakdown 1
- Move unstructured text files into the Hadoop file system
proc hadoop...;
hdfs copyfromlocal="local file" out="hdfs file";
run;
Breakdown 2
- Input the hdfs file to MapReduce program and output results to hdfs file
proc hadoop...;
mapreduce input="hdfs file" output="hdfs outfile" ...;
run;
Breakdown 3
- Read the MapReduce output with SAS for further processing
filename fileref "hdfs outfile"...;
data somedata;
infile fileref input...;
...
run;
PROC HADOOP
- PROC HADOOP submits:
- Hadoop file system (HDFS) commands
- MapReduce programs
- PIG language code
PROC HADOOP <Hadoop-server-option(s)>;
HDFS <Hadoop-server-option(s)> <hdfs-command-option(s)>;
MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>;
PIG <Hadoop-server-option(s)> <pig-code-option(s)>;
PROPERTIES <configuration-properties>;
RUN;
HDFS Statements
HDFS COPYFROMLOCAL="local-file" OUT="output-location" <DELETESOURCE> <OVERWRITE>;
HDFS COPYTOLOCAL="HDFS-file" OUT="output-location" <DELETESOURCE> <OVERWRITE> <KEEPCRC>;
HDFS DELETE="HDFS-file" <NOWARN>;
HDFS MKDIR="HDFS-path";
HDFS RENAME="HDFS-file" OUT="new-name";
Moving a File from SAS to Hadoop
- This program creates a directory in the Hadoop file system (HDFS) and copies a file from the SAS server to the new HDFS directory.
filename hadconfg "/work/hadoop_config.xml";
proc hadoop options=hadconfg username="&std" verbose;
hdfs mkdir="/user/&std/data";
hdfs copyfromlocal="/work/DIACCHAD/data/moby_dick_via_sas.txt" out="/user/&std/data";
run;
Execute MapReduce Code
MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>;
proc hadoop options=hadconfg username="&std";
mapreduce
input="source-file"
output="target-file"
jar="jar file containing MapReduce code"
outputkey="output key class (in MapReduce code)"
outputvalue="output value class"
reduce="reducer class"
combine="combiner class"
map="map class";
run;
Breakdown 1
input="source-file"
- The input file in HDFS the MapReduce program reads
output="target-file"
- The output file in HDFS the MapReduce program writes to
Breakdown 2
jar="jar file containing MapReduce code"
- The JAR file containing the MapReduce program and named classes
Breakdown 3
outputkey="output key class (in MapReduce code)"
- The name of the output key class in dot notation
outputvalue="output value class"
- The name of the output value class in dot notation
Breakdown 4
reduce="reducer class"
combine="combiner class"
map="map class";
- The Java classes in the map reduce program that execute the map, reduce and combine steps
MapReduce Example
- In the demonstration, PROC HADOOP will be used to invoke a MapReduce program that will do the following:
- read a text file containing free unstructured text. This file can be distributed in multiple data nodes in Hadoop
- parse the text into the individual words in each data node
- count up the number of instances of each unique word found in each data node
- combine total counts for each unique word across nodes
- write the results to an HDFS output file
proc hadoop options=hadconfg username="&std" verbose;
mapreduce
jar = "<hdfs path>/hadoop-mr1-cdh.jar"
input = "<hdfs path>/moby_dick_via_sas.txt"
map = "org.apache.hadoop.examples.WordCount$TokenizerMapper"
reduce = "org.apache.hadoop.examples.WordCount$IntSumReducer"
combine = "org.apache.hadoop.examples.WordCount$IntSumReducer"
outputkey = "org.apache.hadoop.io.Text"
outputvalue = "org.apache.hadoop.io.IntWritable"
output = "<hdfs path>/mapoutput"
;
run;
Breakdown 1
- Read a text file containing free unstructured text
input = "<hdfs path>/moby_dick_via_sas.txt"
Breakdown 2
- In parallel in each data node, parse the text into the individual words
map = "org.apache.hadoop.examples.WordCount$TokenizerMapper"
Breakdown 3
- In each data node, in parallel, count up the number of instances of each unique word found
reduce = "org.apache.hadoop.examples.WordCount$IntSumReducer"
Breakdown 4
- Combine the counts for each unique word across data nodes to find final counts
combine = "org.apache.hadoop.examples.WordCount$IntSumReducer"
Breakdown 5
- The output contains each unique word (outputkey) as Text and the number of times it occurred in the input file as Integer
outputkey = "org.apache.hadoop.io.Text"
outputvalue = "org.apache.hadoop.io.IntWritable"
Breakdown 6
- The HDFS location the output is written to
output = "<hdfs path>/mapoutput"
The FILENAME Statement for Hadoop
- In SAS, the FILENAME statement associates a fileref with an external file and an output device or access method.
FILENAME fileref <device type or access method> "external file" <options>;
filename in hadoop "Hadoop-file-path" concat cfg=xml-config-file user="&std";
Breakdown 3
"Hadoop-file-path"
- The directory containing the concatenated files to read
Breakdown 4
concat
- Specifies to read each file in the directory defined by the Hadoop file path
- Note: This option is valid only for reading (not writing) Hadoop files with the FILENAME statement.
Breakdown 5
cfg=xml-config-file
- Points to the location of the Hadoop configuration file on the machine where SAS is executing.
Breakdown 6
user="&std"
- The user ID to connect to Hadoop
Reading a Hadoop File with a DATA Step
filename hadconfg "/work/hadoop_config.xml";
filename mapres hadoop "/user/&std/data/mapoutput" concat cfg=hadconfg user="&std";
data work.commonwords;
infile mapres dlm="09"x;
input word $ count;
...
run;
hadconfg <=====> cfg=hadconfg
mapres <=====> infile mapres
"09"x is the hex code constant for the tab character
Another Example
filename hadconfg "/work/hadoop_config.xml";
filename orders hadoop "/user/shared/data/custorders.txt" cfg=hadconfg user="&std";
data work.custorders;
infile orders;
input @1 customer_id 8. ...;
run;
proc print data=work.custorders;
run;
- Does the DATA step read a single HDFS file or a concatenated directory?
Base SAS: FILENAME for Hadoop (Review)
Reading a Hadoop File with a DATA
Writing a Hadoop File with a DATA
filename hadconfg "/work/hadoop_config.xml";
filename out hadoop "/user/&std/data/custord" dir cfg=hadconfg user="&std";
data _null_;
set work.custorders;
file out(corders) dlm=",";
put customer_id
country
gender
birth_date
product_id
order_date
quantity
costprice_per_unit;
run;
dir: To write to files in the directory specified by the Hadoop file path
corders: Create a file called corders in the HDFS directory "/user/&std/data/custord"
Use Ambari to Browse the Hadoop File System
Part 3: Executing Pig Code With PROC HADOOP
Executing Pig Code with PROC HADOOP
PIG CODE=fileref | "external-file"
PARAMETERS=fileref | "external-file"
REGISTERJAR="external-file(s)"
;
filename pigcode "/workshop/DIACCHAD/pigcode.txt";
proc hadoop options=hadconfg username="hdfs" verbose;
pig code=pigcode;
run;
A = LOAD '/user/shared/data/custord'
USING PigStorage (',')
AS (customer_id, country, gender, birth_date, product_id,
order_date, quantity, costprice_per_unit);
B = FILTER A BY gender == 'F';
store B into '/user/shared/data/student1';
Pig Code Breakdown 1
A = LOAD '/user/shared/data/custord'
USING PigStorage (',')
AS (customer_id, country, gender, birth_date, product_id,
order_date, quantity, costprice_per_unit);
- Load the comma delimited file and name each field
Pig Code Breakdown 2
B = FILTER A BY gender == 'F';
- Subset the loaded file for records where gender = ‘F’
Pig Code Breakdown 3
store B into '/user/shared/data/student1';
- Store the results in the Hadoop file system directory indicated
Using the SQL Pass-Through Facility
SAS/ACCESS Interface to Hadoop
SQL Pass-Through Query Example
proc sql;
connect to hadoop (server=namenode port=10000 subprotocol=hive2 schema=diacchad user="&std");
select * from connection to hadoop
(select employee_name, salary
from salesstaff
where emp_hire_date between '2011-01-01' and '2011-12-31'
);
disconnect from hadoop;
quit;
- Note: The query in the bracket is sent directly to Hive and is executed as a HiveQL query by Hive.
Joining Multiple Tables from Different Databases
Creating a SAS File from Hive Results
Using HiveQL DDL Statements in SAS
Using the SAS/ACCESS LIBNAME Engine
- Using the LIBNAME Statement for Hadoop
- Using Data Set Options
- Creating Views
- Combining Tables
- Transferring Data Sets from SAS to Hive
The LIBNAME Statement (Review)
- The LIBNAME statement assigns a libref to a SAS library.
LIBNAME libref 'SAS-data-library' <options>;
- Rules for naming a
libref:
- The name must be eight characters or less.
- The name must begin with a letter or underscore.
- The remaining characters must be letters, numbers, or underscores.
The SAS/ACCESS LIBNAME Statement
- The SAS/ACCESS LIBNAME statement does the following:
- establishes a libref, which acts as an alias or nickname to Hive.
- permits a Hive table to be referenced by a two-level name.
- enables the use of the SAS/ACCESS LIBNAME statement options to specify how Hive objects are processed by SAS.
- enables you to customize how to connect to Hive.
libname hivedb hadoop server=namenode
subprotocol=hive2
port=10000 schema=diacchad
user=studentX pw=StudentX;
LIBNAME libref engine-name <connection-options>
<LIBNAME-options>;
23 libname hivedb hadoop server=namenode
24 subprotocol=hive2
25 port=10000 schema=diacchad
26 user="&std" pw="&stdpw";
NOTE: Libref HIVEDB was successfully assigned as follows:
Engine: HADOOP
Physical Name: jdbc:hive2://namenode:10000/diacchad
LIBNAME Statement Connection Options
Listing of Hive Tables in the Schema
Contents of a HiveTable
Listing of a Hive Table
SAS/ACCESS Engine Implicit Pass-through
Implicit versus Explicit SQL Pass-through
Optimizing Implicit Pass-through
The SASTRACE= SAS System Option
SASTRACE= SAS System Option 2
NOSTSUFFIX SAS System Option
SASTRACELOC= SAS System Options
SASTRACE= Messages
Using the MEANS and FREQ Procedures
Using the MEANS and FREQ Procedures 2
Using the MEANS and FREQ Procedures 3
Supported SAS Language Functions
Using a Supported SAS Function
Using a Non-Supported Function
Using SAS Data Set Options
Selected SAS Data Set Options
SASDATEFMT= SAS/ACCESS Data Set Option
SASDATEFMT= SAS/ACCESS Data Set Option
Creating a SAS PROC SQL View
Creating a SAS PROC SQL View 2
Using a SAS PROC SQL View
Ways to Combine Data
Passing Joins to Hive
Joining Tables from a Single Connection
Joining Tables from Multiple Connections
SQL Set Operators
Stacking Tables Using Set Operators
Combining Hive Tables with Other Sources
Combining SAS Data Set and Hive Table
Copying Data Sets to Hive